fix(CubeMaster): expire node health after stale nodemeta heartbeat#455
fix(CubeMaster): expire node health after stale nodemeta heartbeat#455zyl1121 wants to merge 4 commits into
Conversation
|
Claude finished @zyl1121's task in 1m 48s —— View job ✅ Code Review — PR #455Progress
Overall AssessmentThis is a clean, well-structured fix that correctly addresses the stale-node-health problem. The architecture is sound, the tests are thorough, and the prefilter hardening ( Decision: APPROVE with minor observations — no blocking issues found. Design Review1. Clean separation of concerns (
|
| Test | File | What it covers |
|---|---|---|
TestGetNodeRefreshesCurrentHealthFromCachedFacts |
export_test.go:270 |
Stale heartbeat → Healthy=false, source not mutated |
TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat |
export_test.go:309 |
Fresh vs stale node in cluster list |
TestGetNodesByIpRefreshesCurrentHealthFromCachedFacts |
export_test.go:343 |
Health recomputation in IP-based lookup |
TestNodeConcurrentCountersUpdateCachedNodeFromReadClone |
export_test.go:371 |
Incr/Decr routing to cached node |
TestCurrentHealthStatusExpiresStaleHeartbeat |
service_resource_test.go:55 |
Fresh + stale heartbeat scenarios |
TestUpdateNodeStatusStoresLastReportedReadyButExportsCurrentHealth |
service_resource_test.go:89 |
ReportedReady preserved, Healthy recomputed |
TestCloneSnapshotWithCurrentHealthRefreshesStaleNode |
service_resource_test.go:121 |
Cloned snapshot has correct health, source unchanged |
TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady |
service_resource_test.go:145 |
JSON serialization correctness |
TestPreFilterExcludesExpiredHeartbeatNode |
prefilter_test.go:34 |
Prefilter skips stale metadata nodes |
TestPreFilterExcludesMetricTimeoutNode |
prefilter_test.go:74 |
Prefilter skips stale metric nodes |
Missing (non-blocking): No test for GetNodes() or GetHealthyNodes() health refresh, though they use the same cloneNodeWithCurrentHealth path tested elsewhere.
Summary
The PR correctly fixes the stale heartbeat problem, is well-tested, and introduces no regressions. The design cleanly separates concerns, read paths are immutable, and the prefilter changes significantly improve scheduler robustness. No blocking issues found.
c8b8153 to
5eb0cda
Compare
|
Claude finished @zyl1121's task in 2m 45s —— View job Code Review — PR #455Progress
Overall AssessmentThis PR correctly addresses the stale-node-health problem described in #454. The architecture is clean, the design is well-isolated, and tests are thorough. The Decision: APPROVE — no blocking issues found. Observations below are minor and can be tracked separately. Design Review1.
|
| Test | File | What it covers |
|---|---|---|
TestGetNodeRefreshesCurrentHealthFromCachedFacts |
export_test.go:270 |
Stale heartbeat → Healthy=false, source not mutated |
TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat |
export_test.go:309 |
Fresh vs stale in cluster list, source not mutated |
TestGetNodesByIpRefreshesCurrentHealthFromCachedFacts |
export_test.go:343 |
Health recomputation on IP-based lookup |
TestNodeConcurrentCountersUpdateCachedNodeFromReadClone |
export_test.go:371 |
IncrNodeConcurrent/DecrNodeConcurrent route to cached node |
TestCurrentHealthStatusExpiresStaleHeartbeat |
service_resource_test.go:55 |
Fresh + stale heartbeat scenarios |
TestUpdateNodeStatusStoresLastReportedReadyButExportsCurrentHealth |
service_resource_test.go:89 |
ReportedReady preserved, Healthy recomputed |
TestCloneSnapshotWithCurrentHealthRefreshesStaleNode |
service_resource_test.go:121 |
Cloned snapshot health correct, source unchanged |
TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady |
service_resource_test.go:145 |
JSON serialization |
TestPreFilterExcludesExpiredHeartbeatNode |
prefilter_test.go:34 |
Prefilter skips stale metadata node |
TestPreFilterExcludesMetricTimeoutNode |
prefilter_test.go:74 |
Prefilter skips stale metric node |
TestToSchedulerNodeDoesNotForgeMetricUpdate |
service_resource_test.go:33 |
Metric fields left zero in nodemeta→scheduler conversion |
TestSyncNodeTemplatesReconcilesHeartbeatState |
export_test.go:160 |
Template reconciliation during heartbeat |
Coverage is thorough. Missing (non-blocking): no direct test for GetNodes() or GetHealthyNodes() health refresh, though both use the same cloneNodeWithCurrentHealth path.
Summary
The PR correctly fixes the stale-heartbeat problem with a clean architecture, immutable read paths, and comprehensive tests. The preloader reliability improvements (Fatalf → continue) are significant. No blocking issues.
| // Copyright (c) 2026 Tencent Inc. | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // |
| // Copyright (c) 2026 Tencent Inc. | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // | ||
|
|
There was a problem hiding this comment.
should not add this line
@fslongjin Removed the extra header comments in a follow-up commit. Thanks for catching that.
5eb0cda to
10fcd30
Compare
|
Claude finished @zyl1121's task in 1m 1s —— View job Re-review after commit
|
fslongjin
left a comment
There was a problem hiding this comment.
Thanks for this PR — the facts-vs-evaluation separation (ReportedReady vs Healthy) and the centralized nodehealth package are clean architectural improvements. I left a few inline comments on areas that could be tightened up.
Summary of inline comments
-
prefilter.go— RedundantMetaDataUpdateAttimeout check;GetHealthyNodesByInstanceTypealready expires stale heartbeats viacloneNodeWithCurrentHealth. -
node.go:Clone()— Struct copy readsLocalCreateNumnon-atomically (data race per Go memory model). -
service.go— Duplicate health evaluation inUpdateNodeStatus(line 239) andtoSchedulerNode(line 412); twotime.Now()calls on the same snapshot. -
node.go:HealthyJSON tag —omitemptyremoved; API contract change worth calling out explicitly. -
service.go:RegisterNode— ReturnscloneSnapshotwithout applying current health, inconsistent withGetNode/ListNodes/UpdateNodeStatus.
| "CalleeCluster": n.ClusterLabel, | ||
| }).Fatalf("%s MetaDataUpdate timeout,lastupdate:%v", n.IP, n.MetaDataUpdateAt) | ||
| }).Warnf("%s MetaDataUpdate timeout,lastupdate:%v", n.IP, n.MetaDataUpdateAt) | ||
| continue |
There was a problem hiding this comment.
🔁 Redundant heartbeat-timeout check
GetHealthyNodesByInstanceType (line 40) already clones every node with cloneNodeWithCurrentHealth, which calls nodehealth.EvaluateFromFacts using the same MetadataTimeout. A node whose heartbeat has expired will already have Healthy=false and will have been excluded from the candidate set before this loop runs.
This second MetaDataUpdateAt guard therefore duplicates what the health evaluation already does. Over time, keeping two separate timeout calculations risks the two checks drifting (e.g. someone changes how cloneNodeWithCurrentHealth computes the timeout but forgets to update this prefilter copy, or vice versa).
Suggestion: Rely solely on the n.Healthy check above (line 49). That gives a single source of truth for heartbeat-liveness decisions.
| cloned.VirtualNodeQuotaArray = append([]int64(nil), n.VirtualNodeQuotaArray...) | ||
| } | ||
| return &cloned | ||
| } |
There was a problem hiding this comment.
LocalCreateNum
The struct copy cloned := *n reads every field including LocalCreateNum (an int64), which is accessed concurrently via atomic.AddInt64 (in LocalCreateNumIncrBy) and atomic.LoadInt64 (in LocalCreateConcurrentLimit).
Per the Go memory model this is a data race and would be flagged by go test -race. In practice on amd64 it is safe (aligned 8-byte reads are atomic at the hardware level), but on 32-bit platforms (ARM) this could tear.
If this project only targets amd64, a brief comment acknowledging the deliberate non-atomic copy would help future readers (and suppress race-detector noise). Otherwise, consider using atomic.LoadInt64(&n.LocalCreateNum) for the copy.
| @@ -419,7 +438,9 @@ func toSchedulerNode(snap *NodeSnapshot) *node.Node { | |||
| OssClusterLabel: snap.ClusterLabel, | |||
There was a problem hiding this comment.
🔄 Duplicate health evaluation with UpdateNodeStatus line 239
UpdateNodeStatus already calls applyCurrentHealth(snap, time.Now()) (line 239, position 46 in this diff), which runs currentHealthStatus internally. syncLocalcache on line 241 then calls toSchedulerNode, which calls currentHealthStatus(snap, time.Now()) again here.
The two time.Now() calls are microseconds apart, so the result is effectively identical — but it is redundant work and creates a maintenance risk: if someone later extends the health-decision logic in one call site but not the other, the two representations of the same snapshot's health will diverge.
Suggestion: Have applyCurrentHealth return the nodehealth.Status it computed (e.g. func applyCurrentHealth(...) nodehealth.Status), and let toSchedulerNode consume it instead of re-deriving from scratch. This keeps a single evaluation per heartbeat update.
| UnhealthyReason string `json:"UnhealthyReason,omitempty"` | ||
|
|
||
| Score float64 `json:"Score,omitempty"` | ||
|
|
There was a problem hiding this comment.
📢 API contract change: omitempty removed
The Healthy field changed from json:"Healthy,omitempty" to json:"Healthy", meaning "healthy":false is now always present in JSON output instead of being omitted. This is deliberate (the nodemeta test confirms it), but it is a wire-format change that existing API consumers may depend on.
Consider calling this out in the PR description and the next changelog entry so integrators know to expect the new field shape.
| applyCurrentHealth(out, now) | ||
| return out | ||
| } | ||
|
|
There was a problem hiding this comment.
❓ Inconsistency: RegisterNode doesn't use this helper
cloneSnapshotWithCurrentHealth is now used by GetNode (line 300) and ListNodes (line 310), and UpdateNodeStatus calls applyCurrentHealth inline (line 239). But RegisterNode (line 199, unchanged) still returns a plain cloneSnapshot(snap) without calling applyCurrentHealth.
The caller gets Healthy=false (zero value) with an empty UnhealthyReason — rather than UnhealthyReason="HeartbeatExpired" which would be the consistent answer for a node that has never sent a heartbeat yet.
Should RegisterNode also use cloneSnapshotWithCurrentHealth for consistency?
|
Claude finished @zyl1121's task in 2m 20s —— View job Claude is analyzing the latest PR state... Progress
Analysis of @fslongjin's Review Against Current PR StateCommit 1. ✅ prefilter.go — Redundant
|
| # | Concern | Status |
|---|---|---|
| 1 | Redundant MetaDataUpdateAt check in prefilter | ✅ Fixed |
| 2 | Non-atomic Clone() LocalCreateNum read | ✅ Fixed |
| 3 | Duplicate health evaluation in toSchedulerNode | ✅ Fixed |
| 4 | JSON omitempty API contract change | 📢 Noted for docs |
| 5 | RegisterNode missing health application | ✅ Fixed |
No further code changes are needed. The PR is in good shape.
Thanks for the detailed review. I addressed the issues you found in follow-up commit, and I also updated the PR description to explicitly mention that |
|
Hi @zyl1121, thanks for this PR! The DCO check is currently failing because all 4 commits are missing the Could you please:
You can do this with an interactive rebase: git rebase -i HEAD~4
# mark each commit as 'reword', and append:
# Signed-off-by: zhengyilei <zhengyilei@myhexin.com>Or more conveniently use git rebase --signoff HEAD~4
git push --force-with-leaseOnce all commits have the sign-off, the DCO check should pass. Thanks! |
faba88d to
7ebdb38
Compare
|
Claude finished @zyl1121's task in 1m 22s —— View job PR #455 Re-trigger AnalysisProgress
Current StateThis is a re-trigger on a PR synchronize event. After examining the branch state: ✅ No new commits since last reviewThe same 4 commits reviewed previously are still present (latest: The PR is in good shape:
|
| Item | Status |
|---|---|
| Code review feedback addressed | ✅ All 5 items fixed in follow-up commits |
| PR approved by @fslongjin | ✅ |
| DCO check passing | ❌ Needs author action |
| New commits to review | None — same 4 commits as before |
Once the DCO sign-off is applied, this PR should be ready to merge.
Fixed and force-pushed the branch with signed-off commits. Thanks! |
Derive current node health from the last reported readiness fact and heartbeat freshness so stale heartbeats no longer remain healthy in nodemeta reads, localcache-backed reads, or scheduler prefilter. Signed-off-by: zhengyilei <zheng_yilei@qq.com>
Reuse a shared helper for nodemeta heartbeat expiry so localcache, nodemeta, and prefilter keep the same timeout rule and avoid drift. Signed-off-by: zhengyilei <zheng_yilei@qq.com>
Remove the added file header comments from the new Go files to match maintainer review feedback and keep this PR aligned with current repo expectations. Signed-off-by: zhengyilei <zheng_yilei@qq.com>
Remove the redundant prefilter heartbeat timeout check, carry the current health state through nodemeta registration and scheduler projection, and harden read-side cloning around mutable node state. Signed-off-by: zhengyilei <zheng_yilei@qq.com>
7ebdb38 to
0aac550
Compare
|
Claude finished @zyl1121's task in 34s —— View job Progress
Analysis — PR #455 Synchronize EventCurrent StateThis is a re-trigger on a PR synchronize event. After examining the branch: ✅ No new commits since last reviewThe same 4 commits reviewed previously are still present (latest: ✅ DCO sign-off confirmedAll 4 commits now carry the The DCO check should now pass. Summary
The PR is fully ready — all review feedback has been addressed, the DCO issue has been resolved, and the PR has been approved. No further action needed from my side. |

Motivation
On the nodemeta path, CubeMaster can keep a node healthy after
cubeletstops reporting heartbeats because the last reportedReady=Truefact is reused as the current health result.This makes stale nodes look healthy in control-plane reads and can leave them eligible for scheduling decisions longer than expected.
What this PR changes
healthyresult instead of a stale cached valuehealthy=falseexplicit in API outputunhealthy_reasonforHeartbeatExpiredandReportedNotReadylocalcacheread paths refresh health from cached facts without mutating shared cached nodesGetNodesByIpTesting
go test -vet=off ./pkg/localcache ./pkg/nodemeta ./pkg/selector/prefilterManual validation:
healthy=falseunhealthy_reasondistinguishesHeartbeatExpiredandReportedNotReady/internal/meta/nodes/{id}and/internal/nodereflect the same health semanticsDocumentation
No standalone documentation update is included in this PR. However, this change does affect the JSON shape of node responses:
healthy=falseis now emitted explicitly instead of being omitted.Related issue
Fixes #454